Deep Mean Field Theory: Layerwise Variance
ثبت نشده
چکیده
A recent line of work has studied the statistical properties of neural networks to great success from a mean field theory perspective, making and verifying very precise predictions of neural network behavior and test time performance. In this paper, we build upon these works to explore two methods for taming the behaviors of random residual networks (with only fully connected layers and no batchnorm). The first method is width variation (WV), i.e. varying the widths of layers as a function of depth. We show that width decay reduces gradient explosion without affecting the mean forward dynamics of the random network. The second method is variance variation (VV), i.e. changing the initialization variances of weights and biases over depth. We show VV, used appropriately, can reduce gradient explosion of tanh and ReLU resnets from exp(Θ( √ L)) and exp(Θ(L)) respectively to constant Θ(1). A complete phase-diagram is derived for how variance decay affects different dynamics, such as those of gradient and activation norms. In particular, we show the existence of many phase transitions where these dynamics switch between exponential, polynomial, logarithmic, and even constant behaviors. Using the obtained mean field theory, we are able to track surprisingly well how VV at initialization time affects training and test time performance on MNIST after a set number of epochs: the level sets of test/train set accuracies coincide with the level sets of the expectations of certain gradient norms or of metric expressivity (as defined in Yang and Schoenholz (2017)), a measure of expansion in a random neural network. Based on insights from past works in deep mean field theory and information geometry, we also provide a new perspective on the gradient explosion/vanishing problems: they lead to ill-conditioning of the Fisher information matrix, causing optimization troubles.
منابع مشابه
Multiscale Analysis of Transverse Cracking in Cross-Ply Laminated Beams Using the Layerwise Theory
A finite element model based on the layerwise theory is developed for the analysis of transverse cracking in cross-ply laminated beams. The numerical model is developed using the layerwise theory of Reddy, and the von Kármán type nonlinear strain field is adopted to accommodate the moderately large rotations of the beam. The finite element beam model is verified by comparing the present numeric...
متن کاملStatestream: a Toolbox to Explore Layerwise- Parallel Deep Neural Networks
Building deep neural networks to control autonomous agents which have to interact in real-time with the physical world, such as robots or automotive vehicles, requires a seamless integration of time into a network’s architecture. The central question of this work is, how the temporal nature of reality should be reflected in the execution of a deep neural network and its components. Most artific...
متن کاملMulti-Prediction Deep Boltzmann Machines
We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks ...
متن کاملWeihrauch-completeness for layerwise computability
Layerwise computability is an effective counterpart to continuous functions that are almosteverywhere defined. This notion was introduced by Hoyrup and Rojas [17]. A function defined on Martin-Löf random inputs is called layerwise computable, if it becomes computable if each input is equipped with some bound on the layer where it passes a fixed universal Martin-Löf test. Interesting examples of...
متن کاملLocal Behavior of Discretely Stiffened Composite Plates and Cylindrical Shells
The Layerwise Shell Theory is used to model discretely stiffened laminated composite plates and cylindrical shells for stress, vibration, pre-buckling and post-buckling analyses. The layerwise theory reduces a 3-D problem to a 2-D problem by expanding the 3-D displacement field as a function of a surface-wise 2-D displacement field and a 1-D interpolation polynomial through the shell thickness....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2018